Introduction

So you want to be a Data Scientist and don’t know where to start? Well, you’ve come to the right place.

Today you’ll learn a little about the economy, a whole bunch of data science principles and practice and where to stay on your next trip to New York City!

Data Science Pipeline

First, let’s look at the data science pipeline.

The data science pipeline starts with defining what major questions one wants to answer and subsequently acquiring and importing the relevant data to be analyzed.

Then, the data is viewed and data tidying must occur; where a rectangular data structure model is assumed and three requirements must be met. Each observation (called an entity) forms a row, each variable (called an attribute) forms a column and each observational unit (type of entity) forms a table.

Leading to the exploratory data analysis process, where the data is transformed and visualized. Data cleaning may be necessary for missing data. When handling missing data, the missing data may be removed, encoded or imputation (replace missing values with the mean of non-missing values) of a numeric variable may be necessary.

Hypothesis testing and machine learning (ML) modeling are the final steps before the data and its results can be communicated.

Image of Pipeline

Source: https://r4ds.had.co.nz/explore-intro.html

Economy & Vacations

The rise of companies like Airbnb has given rise to The Sharing Economy. The sharing economy is a model defined as the facilitation of goods and services on a peer-to-peer level usually through online community platforms. This new model has made it possible for a great deal of people to gain another source of income and for you to have an affordable vacation.

As more sharing economy companies have opened, like Airbinb and Uber, the way we vacation has changed. This change has been documented and open data on it is available.

DataSet Used

The data we will be using in this tutorial is New York City Airbnb Open Data from Kaggle. We will use this data to look at the relationships between types of housing and location.

Preparing Data

Download the dataset.

In this section, we will learn how to load in our dataset, view the data in our dataset, and clean it up so it’s easy for us to work with.

First, let’s load in the following libraries so we can use certain functions:

# for data wranging
library(tidyverse)
library(dplyr)

# for data analysis
library(geosphere)
library(ggplot2)
library(broom)

Loading Data

CSV files are files that include data which are “comma-separated values”, meaning that data values are literally separated by commas.

After we’ve downloaded our CSV file from Kaggle into our working directory, we can use the read_csv function to load the CSV file data into our program’s data frame, which is a table of the data.

# create a dataframe from our CSV file
airbnb_tab <- read.csv("AB_NYC_2019.csv", header=TRUE)

There are some attributes that we don’t need for our purposes, like host_id, host_name, minimum_nights, number_of_reviews, last_review, reviews_per_month, and calculated_host_listings_count. So, let’s remove these from our data frame:

# a vector called "to_remove" that has the names of the attributes we don't want
to_remove <- c('host_id', 
              'host_name', 
               'minimum_nights', 
               'number_of_reviews', 
               'last_review', 
               'reviews_per_month', 
               'calculated_host_listings_count')

# removing attributes from data frame using "to_remove"
airbnb_tab = airbnb_tab[ , !(names(airbnb_tab) %in% to_remove)]

Viewing Data

Here, we see the first 10 rows in our dataset:

knitr::kable(head(airbnb_tab, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5022 Entire Apt: Spacious Studio/Loft by central park Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 0
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5121 BlissArtsSpace! Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 0
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5203 Cozy Clean Guest Room - Family Apt Manhattan Upper West Side 40.80178 -73.96723 Private room 79 0
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188

Some Notes:

  • knitr::kable() is used to make the table “pretty” and easier to read

  • head(df, n=10) is used to view the dataframe with a specific number of rows (head() is not always necessary, you can just list the data frame for it to render)

  • df is where the dataframe goes, in this case airbnb_tab

  • n = determines the number of rows visible, in this case 10 The following is a list of descriptions for the attributes of our data set:

Attribute Description/Unit
id Unique ID for each Airbnb listing
name Name or description of the Airbnb listing
neighbourhood_group Boroughs of New York (Manhattan, Brooklyn, Queens, Bronx, Staten Island)
neighbourhood Neighborhoods of New York
latitude Degrees of latitude, measures distance North and South from Equator
longitude Degrees of longitude, measure distance East and West of Prime Meridian
room_type Type of space offered (Entire home/apt, Private room, Shared room)
price Price of listing, in US Dollars
availability_365 Number of days in a year when the listing is available for booking

Tidying Data

Tidying Data entails the elements listed in the list below.

Elements of a tidy dataset: 1. Each observation/entity forms a row 1. Each variable/attribute forms a column 1. Each observational unit (type of entity) forms a column (i.e. not dependent on one another)

Our dataset is already tidy and meets the criteria above. Each entity is a row and each attribute is a column, where no entity is dependent on another.

However, if your data set is untidy, below is an example on a different small dataset, to show you what to do.

Sample Tidying

Exploratory Data Analysis

In this section, we begin exploring what our data can tell us using visualizations. This will help us to better understand our data and help us make decisions about how we may want to further manipulate the data to see something specific, or decide which methods are best for modelling and Machine Learning!

The main reason for exploratory data analysis, or EDA, is to help us find any problems in our data preparation and gain a sense of variable properties, such as central trends (mean), spread (variance), skew, outliers, and relationships between pairs of variables, like their correlation or covariance.

You can read more about EDA at CMSC 320 EDA Lecture Notes by Professor Hector Corrado Bravo.

Handling Missing Data

Recall that the attribute availability_365 tells us how many days in the year that this particular listing is available for people to book.

Notice that 0 is a value for some of the entities (Airbnb listings). It doesn’t make much sense for us to look at entities that aren’t available at all during the year. In fact, more than 17000 entities are listed at being available for 0 days out of the year! That’s about 1/3 of our dataset.

We’ll call this “missing data”, and remove these entities from our dataset:

airbnb_tab <- airbnb_tab %>%
  filter(availability_365 > 0) # filter() is used to filter the dataframe via specific conditions

knitr::kable(head(airbnb_tab, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314

Note that a way to handle missing data, as mentioned in the data science pipline section (data cleaning), is removing missing data altogether. Having 0 as a value for availability_365 is a form of missing data.

Visualizations

Interactive Map

library(leaflet)

# Creating NYC Map
nyc_map <- leaflet(airbnb_tab) %>%
  addTiles() %>%
  setView(
    lat=40.730610, 
    lng=-73.935242, 
    zoom=11)

nyc_map
leaflet(airbnb_tab) %>% 
  addTiles() %>%
    addAwesomeMarkers(
      lng = ~longitude, 
      lat = ~latitude,
      icon = awesomeIcons(
              icon = 'ios-close',
              iconColor = 'black',
              library = 'ion',
              markerColor = ~ifelse(room_type == 'Entire home/apt', "green", 
                                    ifelse(room_type =='Private room', "orange", 
                                           "red"
                                    )
                            )
            ),
      
      ## Price Label
      label=~as.character(price),
    
      ## Clustering for identifying arrest density
      clusterOptions = markerClusterOptions()
    ) %>%
  addLegend(
    position = 'bottomright', 
    colors= c("green", "orange", "red"), labels=c("Entire Home/Apt", "Private Room", "Shared Room"), 
    title='Types of Rentals', 
  )

Histograms

library(ggplot2)
library(ggthemes)

airbnb_home <- airbnb_tab %>%
  filter(room_type == 'Entire home/apt')

knitr::kable(head(airbnb_home, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6
6848 Only 2 stops to Manhattan studio Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 46
7097 Perfect for Your Parents + Garden Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 321
7726 Hip Historic Brownstone Apartment with Backyard Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 21
7750 Huge 2 BR Upper East Cental Park Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 249
8490 MAISON DES SIRENES1,bohemian apartment Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 233
airbnb_home %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "Entire Homes & Appts. Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

airbnb_home %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  scale_y_continuous(limits = c(0, 1500)) +
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "2019 NYC Homes & Appts. Prices (Up to $1500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

airbnb_room <- airbnb_tab %>%
  filter(room_type == 'Private room')

knitr::kable(head(airbnb_room, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314
6021 Wonderful Guest Bedroom in Manhattan for SINGLES Manhattan Upper West Side 40.79826 -73.96113 Private room 85 333
7322 Chelsea Perfect Manhattan Chelsea 40.74192 -73.99501 Private room 140 12
8024 CBG CtyBGd HelpsHaiti rm#1:1-4 Brooklyn Park Slope 40.68069 -73.97706 Private room 130 347
8025 CBG Helps Haiti Room#2.5 Brooklyn Park Slope 40.67989 -73.97798 Private room 80 364
8110 CBG Helps Haiti Rm #2 Brooklyn Park Slope 40.68001 -73.97865 Private room 110 304
airbnb_room %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "Private Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

airbnb_room %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  scale_y_continuous(limits = c(0, 500)) +
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "2019 NYC Private Room Prices (Up to $500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

airbnb_sroom <- airbnb_tab %>%
  filter(room_type == 'Shared room')

knitr::kable(head(airbnb_sroom, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
12048 LowerEastSide apt share shortterm 1 Manhattan Lower East Side 40.71401 -73.98917 Shared room 40 188
54453 MIDTOWN WEST - Large alcove studio Manhattan Hell’s Kitchen 40.76548 -73.98474 Shared room 105 363
173072 Cozy Pre-War Harlem Apartment Manhattan Harlem 40.80827 -73.95329 Shared room 49 248
391948 Single Room Queens Ozone Park 40.68581 -73.84642 Shared room 45 364
467634 yahmanscrashpads Queens Jamaica 40.67747 -73.76493 Shared room 39 353
564751 Artist space for creative nomads. Manhattan Upper West Side 40.80165 -73.96287 Shared room 76 324
737126 Williamsburg Loft!! Bedford L 1blk! Brooklyn Williamsburg 40.71714 -73.95447 Shared room 195 364
765203 Art Lover’s Abode Brooklyn Brooklyn Williamsburg 40.70745 -73.94307 Shared room 52 88
773497 Great spot in Brooklyn Brooklyn Bedford-Stuyvesant 40.69407 -73.94551 Shared room 200 365
819206 Cute shared studio apartment Manhattan East Harlem 40.79106 -73.95058 Shared room 45 313
airbnb_sroom %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "Shared Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

airbnb_sroom %>%
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+
  scale_y_continuous(limits = c(0, 200)) +
  coord_flip() +
  theme_economist() + 
  scale_fill_economist() +
  labs(title = "2019 NYC Shared Room Prices (Up to $200/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

Hypothesis Testing & Machine Learning

Linear Regression

With datasets that are large, it can be very useful to generate a linear regression, or a line of “best fit”, for an easier interpretation of the data. This data analysis technique is also an effective way to learn about general trends of our data set and lets us construct confidence intervals and do hypothesis testing, which analyzes and tests for relationships between variables.

We want to look at the relationship between price and distance away from Times Square in New York City, one of the largest populated cities in New York. We are looking at Time Square since it is a major commercial intersection, tourist destination, entertainment center, and neighborhood in the Midtown Manhattan section of NYC (Wikipedia).

For these reasons, we would like to see if Airbnb listings would increase as their distance to Times Square (latitude 40.757, longitude -73.986) decreases, and vice versa. We will be using functions from the geosphere library to calculate distance between coordinates.

First, let’s add an attribute called distToTimesSquare in our dataset. This will contain the distance (in miles) between each listing and Times Square.

coordsTimeSquare <- c(-73.986, 40.757)

airbnb_tab <- airbnb_tab %>%
  mutate(distToTimesSquare = by(airbnb_tab, 1:nrow(airbnb_tab), 
                                function(row) { 
                                  distHaversine(c(row$longitude, row$latitude), coordsTimeSquare)
                                }) / 1609) # divide by 1609 to convert meters to miles

knitr::kable(head(airbnb_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365 7.6101584
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355 0.2614253
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365 4.2767099
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194 5.1585482
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129 0.8654731
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220 0.5487460

Second, let’s split our current airbnb_tab data frame into two data frames, one with room_type == "Entire home/apt" and one with room_type == "Private room". This is because prices are much more expensive for “Entire home/apt” listings, so we don’t want to get confused when regressing against distance. We only want to see the relation between distance and prices, not between prices and size of the space being listed!

# create new dataframe of listings where room_type=="Entire home/apt"
entire_tab <- airbnb_tab %>%
  filter(room_type == "Entire home/apt")

# create new dataframe of listings where room_type=="Private room"
private_tab <- airbnb_tab %>%
  filter(room_type == "Private room")

shared_tab <- airbnb_tab %>%
  filter(room_type == "Shared room")

knitr::kable(head(entire_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355 0.2614253
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194 5.1585482
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129 0.8654731
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188 3.0224159
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6 3.3701851
6848 Only 2 stops to Manhattan studio Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 46 3.7708536
knitr::kable(head(private_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365 7.610158
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365 4.276710
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220 0.548746
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39 0.295381
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314 6.138165
6021 Wonderful Guest Bedroom in Manhattan for SINGLES Manhattan Upper West Side 40.79826 -73.96113 Private room 85 333 3.137898
knitr::kable(head(shared_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
12048 LowerEastSide apt share shortterm 1 Manhattan Lower East Side 40.71401 -73.98917 Shared room 40 188 2.978924
54453 MIDTOWN WEST - Large alcove studio Manhattan Hell’s Kitchen 40.76548 -73.98474 Shared room 105 363 0.590397
173072 Cozy Pre-War Harlem Apartment Manhattan Harlem 40.80827 -73.95329 Shared room 49 248 3.939358
391948 Single Room Queens Ozone Park 40.68581 -73.84642 Shared room 45 364 8.821836
467634 yahmanscrashpads Queens Jamaica 40.67747 -73.76493 Shared room 39 353 12.832089
564751 Artist space for creative nomads. Manhattan Upper West Side 40.80165 -73.96287 Shared room 76 324 3.318301

Third, we want to create a scatter plot of the prices of listings against their distance to Times Square. We’ll also add a regression line to this scatter plot to the general increasing or decreasing trend in our data! Let’s do this three times, once for each room_type we are interested in.

entire_tab %>%
    ggplot(aes(x=entire_tab$distToTimesSquare,y=entire_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 1500) + # set the upper limit of prices to $1500
    labs(title="Homes & Appts. Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

private_tab %>%
    ggplot(aes(x=private_tab$distToTimesSquare,y=private_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 500) + # set the upper limit of prices to $500
    labs(title="Private Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

shared_tab %>%
    ggplot(aes(x=shared_tab$distToTimesSquare,y=shared_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 200) + # set the upper limit of prices to $200
    labs(title="Shared Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

Lastly, let’s analyze the resulting models quantitatively using broom::tidy.

entire_fit <- lm(distToTimesSquare~price, data=entire_tab)
entire_fit %>%
  tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  4.43    0.0284        156.  0.      
## 2 price       -0.00149 0.0000760     -19.6 6.45e-85
private_fit <- lm(distToTimesSquare~price, data=private_tab)
private_fit %>%
  tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  5.44     0.0277       197.  0.      
## 2 price       -0.00237  0.000141     -16.8 7.26e-63
shared_fit <- lm(distToTimesSquare~price, data=shared_tab)
shared_fit %>%
  tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  5.41      0.128       42.3  2.95e-212
## 2 price       -0.00555   0.00109     -5.11 3.93e-  7

As we can see in all three of these linear regression plots, the prices of all the types of listing decreases slowly as the location of the listing gets further away from Times Square. From the models, it is clear that prices of Airbnb listings decrease by 0.00149 (homes and apts), 0.00237 (private rooms), and 0.00555 (shared rooms) on average each mile further away from Times Square.

Even though we can clearly see a trend in our linear regressions, it is best to conduct hypothesis testing in order to determine if our results are valid and there is a significantly meaningful relationship between Airbnb prices and their distance away from high traffic locations, such as Times Square in New York City (Statistics How To).

Let’s ask the question: Do we reject the null hypothesis of no relationship between price and distance from Times Square?

Our answer: Yes, we reject the null hypothesis since the p-values for all three linear regressions are significantly smaller than 0.05. A p-value less than or equal to 0.05 means that the results for our data holds, that our data is repeatable, and that our results didn’t just happen by chance (Statistics How To).

You can read more about Linear Regression at CMSC 320 Linear Regression Lecture Notes by Professor Hector Corrada Bravo.

ML Model

Conclusion